An Amharic Stemmer : Reducing Words to their Citation Forms

نویسندگان

  • Atelach Alemu Argaw
  • Lars Asker
چکیده

Stemming is an important analysis step in a number of areas such as natural language processing (NLP), information retrieval (IR), machine translation(MT) and text classification. In this paper we present the development of a stemmer for Amharic that reduces words to their citation forms. Amharic is a Semitic language with rich and complex morphology. The application of such a stemmer is in dictionary based cross language IR, where there is a need in the translation step, to look up terms in a machine readable dictionary (MRD). We apply a rule based approach supplemented by occurrence statistics of words in a MRD and in a 3.1M words news corpus. The main purpose of the statistical supplements is to resolve ambiguity between alternative segmentations. The stemmer is evaluated on Amharic text from two domains, news articles and a classic fiction text. It is shown to have an accuracy of 60% for the old fashioned fiction text and 75% for the news articles.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Stemming Hausa text: using affix-stripping rules and reference look-up

Stemming is a process of reducing a derivational or inflectional word to its root or stem by stripping all its affixes. It is been used in applications such as information retrieval, machine translation, and text summarization, as their preprocessing step to increase efficiency. Currently, there are a few stemming algorithms which have been developed for languages such as English, Arabic, Turki...

متن کامل

Stemmer for Serbian language

In linguistic morphology and information retrieval, stemming is the process for reducing inflected (or sometimes derived) words to their stem, base or root form—generally a written word form. In this work is presented suffix-stripping stemmer for Serbian language, one of the highly inflectional languages.

متن کامل

TelStem:An Unsupervised Telugu Stemmer with Heuristic Improvements and Normalized Signatures

Stemming is a technique for reducing variant forms of a word to their roots (or stems) by enabling extraction of common suffixes. Stem need not correspond to the linguistic root of a word. Stemming is predominantly used in IR system to enrich retrieval effectiveness and to reduce the size of index for information retrieval task. This paper presents a systematic way of algorithm and implementati...

متن کامل

Citation Behaviours of Applied Linguists in Discussion Sections of ‎Research Articles

It is now generally accepted that academic writing is a social activity by which the authors negotiate with their audience to gain community acceptance for their findings. One of the ways to achieve such an acceptance is by establishing intertextual links to prior research using citation. Despite a vast research on the topic and suggestion of typologies for the form and function of citation in ...

متن کامل

Dictionary-based Amharic - English Information Retrieval

We present two approaches to the Amharic – English bilingual track in CLEF 2004. Both experiments use a dictionary based approach to translate the Amharic queries into English Bags-of-words, but while one approach removes non-content bearing words from the Amharic queries based on their IDF value, the other uses a list of English stop words to perform the same task. The resulting translated (En...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2007